Biostat 203B Homework 3

Due Feb 23 @ 11:59PM

Author

Yuxin Zhang and 406328706

Display machine information for reproducibility:

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.3.1    fastmap_1.1.1     cli_3.6.2        
 [5] tools_4.3.1       htmltools_0.5.7   rstudioapi_0.15.0 yaml_2.3.8       
 [9] rmarkdown_2.25    knitr_1.45        jsonlite_1.8.8    xfun_0.41        
[13] digest_0.6.34     rlang_1.1.3       evaluate_0.23    

Load necessary libraries (you can add more as needed).

library(arrow)

Attaching package: 'arrow'
The following object is masked from 'package:utils':

    timestamp
library(memuse)
library(pryr)
library(R.utils)
Loading required package: R.oo
Loading required package: R.methodsS3
R.methodsS3 v1.8.2 (2022-06-13 22:00:14 UTC) successfully loaded. See ?R.methodsS3 for help.
R.oo v1.26.0 (2024-01-24 05:12:50 UTC) successfully loaded. See ?R.oo for help.

Attaching package: 'R.oo'
The following object is masked from 'package:R.methodsS3':

    throw
The following objects are masked from 'package:methods':

    getClasses, getMethods
The following objects are masked from 'package:base':

    attach, detach, load, save
R.utils v2.12.3 (2023-11-18 01:00:02 UTC) successfully loaded. See ?R.utils for help.

Attaching package: 'R.utils'
The following object is masked from 'package:arrow':

    timestamp
The following object is masked from 'package:utils':

    timestamp
The following objects are masked from 'package:base':

    cat, commandArgs, getOption, isOpen, nullfile, parse, warnings
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ purrr::compose()      masks pryr::compose()
✖ lubridate::duration() masks arrow::duration()
✖ tidyr::extract()      masks R.utils::extract()
✖ dplyr::filter()       masks stats::filter()
✖ dplyr::lag()          masks stats::lag()
✖ purrr::partial()      masks pryr::partial()
✖ dplyr::where()        masks pryr::where()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Display your machine memory.

memuse::Sys.meminfo()
Totalram:  16.000 GiB 
Freeram:    2.218 GiB 

In this exercise, we use tidyverse (ggplot2, dplyr, etc) to explore the MIMIC-IV data introduced in homework 1 and to build a cohort of ICU stays.

Q1. Visualizing patient trajectory

Visualizing a patient’s encounters in a health care system is a common task in clinical data analysis. In this question, we will visualize a patient’s ADT (admission-discharge-transfer) history and ICU vitals in the MIMIC-IV data.

Q1.1 ADT history

A patient’s ADT history records the time of admission, discharge, and transfer in the hospital. This figure shows the ADT history of the patient with subject_id 10001217 in the MIMIC-IV data. The x-axis is the calendar time, and the y-axis is the type of event (ADT, lab, procedure). The color of the line segment represents the care unit. The size of the line segment represents whether the care unit is an ICU/CCU. The crosses represent lab events, and the shape of the dots represents the type of procedure. The title of the figure shows the patient’s demographic information and the subtitle shows top 3 diagnoses.

Do a similar visualization for the patient with subject_id 10013310 using ggplot.

Hint: We need to pull information from data files patients.csv.gz, admissions.csv.gz, transfers.csv.gz, labevents.csv.gz, procedures_icd.csv.gz, diagnoses_icd.csv.gz, d_icd_procedures.csv.gz, and d_icd_diagnoses.csv.gz. For the big file labevents.csv.gz, use the Parquet format you generated in Homework 2. For reproducibility, make the Parquet folder labevents_pq available at the current working directory hw3, for example, by a symbolic link. Make your code reproducible.

Answer

patient of interest:

sid <- 10013310
sid_adt <- read_csv("~/mimic/hosp/transfers.csv.gz") %>%
  filter(subject_id == sid & eventtype != "discharge") 
Rows: 1890972 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): eventtype, careunit
dbl  (3): subject_id, hadm_id, transfer_id
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sid_lab <- arrow::open_dataset("labevents_pq") |>
  filter(subject_id == sid) |>
  collect() 
sid_hpcd <- read_csv("~/mimic/hosp/procedures_icd.csv.gz") %>%
  filter(subject_id == sid) %>%
  mutate(chartdate = as.POSIXct(chartdate)) 
Rows: 669186 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): icd_code
dbl  (4): subject_id, hadm_id, seq_num, icd_version
date (1): chartdate

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sid_ipcd <- read_csv("~/mimic/hosp/d_icd_procedures.csv.gz") 
Rows: 85257 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): icd_code, long_title
dbl (1): icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sid_pcd <- left_join(sid_hpcd , sid_ipcd, by = "icd_code") 
sid_pat <- read_csv("~/mimic/hosp/patients.csv.gz") |>
  filter(subject_id == sid) 
Rows: 299712 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): gender, anchor_year_group
dbl  (3): subject_id, anchor_age, anchor_year
date (1): dod

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sid_ad <- read_csv("~/mimic/hosp/admissions.csv.gz") |>
  filter(subject_id == sid) 
Rows: 431231 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl  (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sid_dgn <- read_csv("~/mimic/hosp/diagnoses_icd.csv.gz") |>
  filter(subject_id == sid) 
Rows: 4756326 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): icd_code
dbl (4): subject_id, hadm_id, seq_num, icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sid_ddgn <- read_csv("~/mimic/hosp/d_icd_diagnoses.csv.gz") 
Rows: 109775 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): icd_code, long_title
dbl (1): icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
top_sid_ddgn <- sid_dgn %>%
  slice_head(n = 3) %>%  
  ungroup() %>%  
  inner_join(sid_ddgn, by = "icd_code") 
  ggplot() +
  geom_segment(data = sid_adt, aes(
    x = intime, 
    xend = outtime,
    y = "ADT",
    yend = "ADT",
    color = careunit,
    linewidth = str_detect(careunit, "(ICU|CCU)")))+
    scale_linewidth_discrete(guide = "none")+
  geom_point(data = sid_lab,
             mapping = aes(
               x = charttime,
               y = "Lab"),
             shape = 3
             ) +
  geom_jitter(data = sid_pcd,
           mapping = aes(
             x = chartdate,
             y = "Procedure",
             shape = long_title
           )) +
  scale_shape_manual(values = c(1:10),
                     labels = unique(sid_pcd$long_title))+
  guides(shape = guide_legend(ncol = 2))+
  labs(
    x = "Calendar Time",
    y = "",
    shape = "Procedure", 
    color = "Care Unit",
    title = str_c(
  "Patient ", sid, ", ",
  sid_pat$gender, ", ",
  sid_pat$anchor_age, " years old, ",
  tolower(sid_ad$race), "\n",
  paste(paste(top_sid_ddgn$long_title[1:3]), collapse = "\n"))
 )+
  scale_y_discrete(
    limits = c("Procedure", "Lab", "ADT"),
  ) +
  theme(
    legend.position = "bottom",
    legend.box = "vertical"
  )
Warning: Using linewidth for a discrete variable is not advised.

Q1.2 ICU stays

ICU stays are a subset of ADT history. This figure shows the vitals of the patient 10001217 during ICU stays. The x-axis is the calendar time, and the y-axis is the value of the vital. The color of the line represents the type of vital. The facet grid shows the abbreviation of the vital and the stay ID.

Do a similar visualization for the patient 10013310.

Answer

sid_ite <- read_csv("~/mimic/icu/d_items.csv.gz") |>
  filter(abbreviation %in% c("HR", "NBPd", "NBPs", "RR", "Temperature F")) 
Rows: 4014 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): label, abbreviation, linksto, category, unitname, param_type
dbl (3): itemid, lownormalvalue, highnormalvalue

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sid_cet <- arrow::open_dataset("chartevents_pq") |>
  filter(subject_id == sid) |>
  collect() |>
  inner_join(sid_ite, by = "itemid") |>
  mutate(value = as.numeric(value)) 
ggplot(sid_cet) +
  geom_line(aes(x = charttime, y =value,group= itemid, 
                    color = abbreviation), na.rm = T) +
  geom_point(aes(x = charttime, y =value,group= itemid, 
                    color = abbreviation)) +
  facet_grid(abbreviation ~ stay_id, scales = "free") +
  labs(x = "Calendar Time", title = paste("Patient", sid, "ICU stays - Vitals")) +
  theme_light() +
  theme(
    legend.position = "none",
    axis.text.x = element_text(angle = 20, hjust = 1)
  ) 

Q2. ICU stays

icustays.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/icustays/) contains data about Intensive Care Units (ICU) stays. The first 10 lines are

zcat < ~/mimic/icu/icustays.csv.gz | head
subject_id,hadm_id,stay_id,first_careunit,last_careunit,intime,outtime,los
10000032,29079034,39553978,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2180-07-23 14:00:00,2180-07-23 23:50:47,0.4102662037037037
10000980,26913865,39765666,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2189-06-27 08:42:00,2189-06-27 20:38:27,0.4975347222222222
10001217,24597018,37067082,Surgical Intensive Care Unit (SICU),Surgical Intensive Care Unit (SICU),2157-11-20 19:18:02,2157-11-21 22:08:00,1.1180324074074075
10001217,27703517,34592300,Surgical Intensive Care Unit (SICU),Surgical Intensive Care Unit (SICU),2157-12-19 15:42:24,2157-12-20 14:27:41,0.9481134259259258
10001725,25563031,31205490,Medical/Surgical Intensive Care Unit (MICU/SICU),Medical/Surgical Intensive Care Unit (MICU/SICU),2110-04-11 15:52:22,2110-04-12 23:59:56,1.338587962962963
10001884,26184834,37510196,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2131-01-11 04:20:05,2131-01-20 08:27:30,9.171817129629629
10002013,23581541,39060235,Cardiac Vascular Intensive Care Unit (CVICU),Cardiac Vascular Intensive Care Unit (CVICU),2160-05-18 10:00:53,2160-05-19 17:33:33,1.3143518518518518
10002155,20345487,32358465,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2131-03-09 21:33:00,2131-03-10 18:09:21,0.8585763888888889
10002155,23822395,33685454,Coronary Care Unit (CCU),Coronary Care Unit (CCU),2129-08-04 12:45:00,2129-08-10 17:02:38,6.178912037037037

Q2.1 Ingestion

Import icustays.csv.gz as a tibble icustays_tble.

Answer

icustays_tble <- read_csv("~/mimic/icu/icustays.csv.gz") |>
  as_tibble() |>
  print(width = Inf)
Rows: 73181 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): first_careunit, last_careunit
dbl  (4): subject_id, hadm_id, stay_id, los
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 73,181 × 8
   subject_id  hadm_id  stay_id first_careunit                                  
        <dbl>    <dbl>    <dbl> <chr>                                           
 1   10000032 29079034 39553978 Medical Intensive Care Unit (MICU)              
 2   10000980 26913865 39765666 Medical Intensive Care Unit (MICU)              
 3   10001217 24597018 37067082 Surgical Intensive Care Unit (SICU)             
 4   10001217 27703517 34592300 Surgical Intensive Care Unit (SICU)             
 5   10001725 25563031 31205490 Medical/Surgical Intensive Care Unit (MICU/SICU)
 6   10001884 26184834 37510196 Medical Intensive Care Unit (MICU)              
 7   10002013 23581541 39060235 Cardiac Vascular Intensive Care Unit (CVICU)    
 8   10002155 20345487 32358465 Medical Intensive Care Unit (MICU)              
 9   10002155 23822395 33685454 Coronary Care Unit (CCU)                        
10   10002155 28994087 31090461 Medical/Surgical Intensive Care Unit (MICU/SICU)
   last_careunit                                    intime             
   <chr>                                            <dttm>             
 1 Medical Intensive Care Unit (MICU)               2180-07-23 14:00:00
 2 Medical Intensive Care Unit (MICU)               2189-06-27 08:42:00
 3 Surgical Intensive Care Unit (SICU)              2157-11-20 19:18:02
 4 Surgical Intensive Care Unit (SICU)              2157-12-19 15:42:24
 5 Medical/Surgical Intensive Care Unit (MICU/SICU) 2110-04-11 15:52:22
 6 Medical Intensive Care Unit (MICU)               2131-01-11 04:20:05
 7 Cardiac Vascular Intensive Care Unit (CVICU)     2160-05-18 10:00:53
 8 Medical Intensive Care Unit (MICU)               2131-03-09 21:33:00
 9 Coronary Care Unit (CCU)                         2129-08-04 12:45:00
10 Medical/Surgical Intensive Care Unit (MICU/SICU) 2130-09-24 00:50:00
   outtime               los
   <dttm>              <dbl>
 1 2180-07-23 23:50:47 0.410
 2 2189-06-27 20:38:27 0.498
 3 2157-11-21 22:08:00 1.12 
 4 2157-12-20 14:27:41 0.948
 5 2110-04-12 23:59:56 1.34 
 6 2131-01-20 08:27:30 9.17 
 7 2160-05-19 17:33:33 1.31 
 8 2131-03-10 18:09:21 0.859
 9 2129-08-10 17:02:38 6.18 
10 2130-09-27 22:13:41 3.89 
# ℹ 73,171 more rows

Q2.2 Summary and visualization

How many unique subject_id? Can a subject_id have multiple ICU stays? Summarize the number of ICU stays per subject_id by graphs.

Answer

unique_id <- n_distinct(icustays_tble$subject_id)
print(unique_id)
[1] 50920
icu_stays_summary <- icustays_tble %>%
  group_by(subject_id) %>%
  summarise(number_of_stays = n_distinct(stay_id))
icu_stays_summary
# A tibble: 50,920 × 2
   subject_id number_of_stays
        <dbl>           <int>
 1   10000032               1
 2   10000980               1
 3   10001217               2
 4   10001725               1
 5   10001884               1
 6   10002013               1
 7   10002155               3
 8   10002348               1
 9   10002428               4
10   10002430               1
# ℹ 50,910 more rows

There are 50920 unique subject_id and a subject_id can have multiple ICU stays.

ggplot(icu_stays_summary, aes(x = number_of_stays)) +
  geom_bar() +
  labs(title = "Distribution of Number of ICU Stays per Subject",
       x = "Number of ICU Stays",
       y = "Count of Subject IDs") + 
  theme_minimal()

From the plot above, we find that most of the subjects had very few ICU stay, with a steep decline in the number of subjects as the number of ICU stays increases. The tallest bar is at the leftmost side of the graph, representing lowest number of ICU stays is 1 time, and this count is by far the highest.However, there were also some subjects who have multiple ICU stays. The most one had 37 stays.

Q3. admissions data

Information of the patients admitted into hospital is available in admissions.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/admissions/ for details of each field in this file. The first 10 lines are

zcat < ~/mimic/hosp/admissions.csv.gz | head
subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admit_provider_id,admission_location,discharge_location,insurance,language,marital_status,race,edregtime,edouttime,hospital_expire_flag
10000032,22595853,2180-05-06 22:23:00,2180-05-07 17:15:00,,URGENT,P874LG,TRANSFER FROM HOSPITAL,HOME,Other,ENGLISH,WIDOWED,WHITE,2180-05-06 19:17:00,2180-05-06 23:30:00,0
10000032,22841357,2180-06-26 18:27:00,2180-06-27 18:49:00,,EW EMER.,P09Q6Y,EMERGENCY ROOM,HOME,Medicaid,ENGLISH,WIDOWED,WHITE,2180-06-26 15:54:00,2180-06-26 21:31:00,0
10000032,25742920,2180-08-05 23:44:00,2180-08-07 17:50:00,,EW EMER.,P60CC5,EMERGENCY ROOM,HOSPICE,Medicaid,ENGLISH,WIDOWED,WHITE,2180-08-05 20:58:00,2180-08-06 01:44:00,0
10000032,29079034,2180-07-23 12:35:00,2180-07-25 17:55:00,,EW EMER.,P30KEH,EMERGENCY ROOM,HOME,Medicaid,ENGLISH,WIDOWED,WHITE,2180-07-23 05:54:00,2180-07-23 14:00:00,0
10000068,25022803,2160-03-03 23:16:00,2160-03-04 06:26:00,,EU OBSERVATION,P51VDL,EMERGENCY ROOM,,Other,ENGLISH,SINGLE,WHITE,2160-03-03 21:55:00,2160-03-04 06:26:00,0
10000084,23052089,2160-11-21 01:56:00,2160-11-25 14:52:00,,EW EMER.,P6957U,WALK-IN/SELF REFERRAL,HOME HEALTH CARE,Medicare,ENGLISH,MARRIED,WHITE,2160-11-20 20:36:00,2160-11-21 03:20:00,0
10000084,29888819,2160-12-28 05:11:00,2160-12-28 16:07:00,,EU OBSERVATION,P63AD6,PHYSICIAN REFERRAL,,Medicare,ENGLISH,MARRIED,WHITE,2160-12-27 18:32:00,2160-12-28 16:07:00,0
10000108,27250926,2163-09-27 23:17:00,2163-09-28 09:04:00,,EU OBSERVATION,P38XXV,EMERGENCY ROOM,,Other,ENGLISH,SINGLE,WHITE,2163-09-27 16:18:00,2163-09-28 09:04:00,0
10000117,22927623,2181-11-15 02:05:00,2181-11-15 14:52:00,,EU OBSERVATION,P2358X,EMERGENCY ROOM,,Other,ENGLISH,DIVORCED,WHITE,2181-11-14 21:51:00,2181-11-15 09:57:00,0

Q3.1 Ingestion

Import admissions.csv.gz as a tibble admissions_tble.

admissions_tble <- read_csv("~/mimic/hosp/admissions.csv.gz") |>
  as_tibble() |>
  print(width = Inf)
Rows: 431231 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl  (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 431,231 × 16
   subject_id  hadm_id admittime           dischtime           deathtime
        <dbl>    <dbl> <dttm>              <dttm>              <dttm>   
 1   10000032 22595853 2180-05-06 22:23:00 2180-05-07 17:15:00 NA       
 2   10000032 22841357 2180-06-26 18:27:00 2180-06-27 18:49:00 NA       
 3   10000032 25742920 2180-08-05 23:44:00 2180-08-07 17:50:00 NA       
 4   10000032 29079034 2180-07-23 12:35:00 2180-07-25 17:55:00 NA       
 5   10000068 25022803 2160-03-03 23:16:00 2160-03-04 06:26:00 NA       
 6   10000084 23052089 2160-11-21 01:56:00 2160-11-25 14:52:00 NA       
 7   10000084 29888819 2160-12-28 05:11:00 2160-12-28 16:07:00 NA       
 8   10000108 27250926 2163-09-27 23:17:00 2163-09-28 09:04:00 NA       
 9   10000117 22927623 2181-11-15 02:05:00 2181-11-15 14:52:00 NA       
10   10000117 27988844 2183-09-18 18:10:00 2183-09-21 16:30:00 NA       
   admission_type    admit_provider_id admission_location     discharge_location
   <chr>             <chr>             <chr>                  <chr>             
 1 URGENT            P874LG            TRANSFER FROM HOSPITAL HOME              
 2 EW EMER.          P09Q6Y            EMERGENCY ROOM         HOME              
 3 EW EMER.          P60CC5            EMERGENCY ROOM         HOSPICE           
 4 EW EMER.          P30KEH            EMERGENCY ROOM         HOME              
 5 EU OBSERVATION    P51VDL            EMERGENCY ROOM         <NA>              
 6 EW EMER.          P6957U            WALK-IN/SELF REFERRAL  HOME HEALTH CARE  
 7 EU OBSERVATION    P63AD6            PHYSICIAN REFERRAL     <NA>              
 8 EU OBSERVATION    P38XXV            EMERGENCY ROOM         <NA>              
 9 EU OBSERVATION    P2358X            EMERGENCY ROOM         <NA>              
10 OBSERVATION ADMIT P75S70            WALK-IN/SELF REFERRAL  HOME HEALTH CARE  
   insurance language marital_status race  edregtime          
   <chr>     <chr>    <chr>          <chr> <dttm>             
 1 Other     ENGLISH  WIDOWED        WHITE 2180-05-06 19:17:00
 2 Medicaid  ENGLISH  WIDOWED        WHITE 2180-06-26 15:54:00
 3 Medicaid  ENGLISH  WIDOWED        WHITE 2180-08-05 20:58:00
 4 Medicaid  ENGLISH  WIDOWED        WHITE 2180-07-23 05:54:00
 5 Other     ENGLISH  SINGLE         WHITE 2160-03-03 21:55:00
 6 Medicare  ENGLISH  MARRIED        WHITE 2160-11-20 20:36:00
 7 Medicare  ENGLISH  MARRIED        WHITE 2160-12-27 18:32:00
 8 Other     ENGLISH  SINGLE         WHITE 2163-09-27 16:18:00
 9 Other     ENGLISH  DIVORCED       WHITE 2181-11-14 21:51:00
10 Other     ENGLISH  DIVORCED       WHITE 2183-09-18 08:41:00
   edouttime           hospital_expire_flag
   <dttm>                             <dbl>
 1 2180-05-06 23:30:00                    0
 2 2180-06-26 21:31:00                    0
 3 2180-08-06 01:44:00                    0
 4 2180-07-23 14:00:00                    0
 5 2160-03-04 06:26:00                    0
 6 2160-11-21 03:20:00                    0
 7 2160-12-28 16:07:00                    0
 8 2163-09-28 09:04:00                    0
 9 2181-11-15 09:57:00                    0
10 2183-09-18 20:20:00                    0
# ℹ 431,221 more rows

Q3.2 Summary and visualization

Summarize the following information by graphics and explain any patterns you see.

  • number of admissions per patient
  • admission hour (anything unusual?)
  • admission minute (anything unusual?)
  • length of hospital stay (from admission to discharge) (anything unusual?)

According to the MIMIC-IV documentation,

All dates in the database have been shifted to protect patient confidentiality. Dates will be internally consistent for the same patient, but randomly distributed in the future. Dates of birth which occur in the present time are not true dates of birth. Furthermore, dates of birth which occur before the year 1900 occur if the patient is older than 89. In these cases, the patient’s age at their first admission has been fixed to 300.

Answer

num_of_admission <- admissions_tble %>%
  group_by(subject_id) %>%
  summarise(number_of_admission = n_distinct(hadm_id))
ggplot(num_of_admission, aes(x = number_of_admission)) +
  geom_bar() +
  labs(title = "Distribution of Number of Admissions per Patient",
       x = "Number of Admissions",
       y = "Count of Patients") +
  theme_minimal()

summary(num_of_admission$number_of_admission)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.000   1.000   2.386   2.000 238.000 

From the plots above, we notice that the number of admissions per patient is right-skewed. Most patients have 1 or 2 admissions. There are also some outliers and the maximum number of admissions per patient is 238.

admissions_tble3 <- admissions_tble |>
  mutate(admission_hour = hour(admittime))
ggplot(admissions_tble3, aes(x = admission_hour)) +
  geom_bar() +
  labs(title = "Distribution of Admission Hours", x = "Hour of Admission", y = "Count") +
  theme_minimal()

boxplot(admissions_tble3$admission_hour)

The plot suggests that admissions are correlated with typical daily activity patterns. The most significant peak occurs just after midnight, which could be a result of a systematic recording practice. There are also peaks at 7 am , which could be due to the start of a day job.

admissions_tble1 <- admissions_tble %>%
  mutate(admission_minute = minute(admittime))
ggplot(admissions_tble, aes(x = minute(admittime))) +
  geom_bar() +
  labs(title = "Distribution of Admission Minutes", x = "Minute of Admission", y = "Count") +
  theme_minimal()

It showed peaks at the 0, 15, 30, and 45-minute marks of each hour, suggesting that admissions are more likely to be recorded or occur at these times. This might be due to the way data is collected or certain administrative processes that happen at these times.

admissions_tble2 <- admissions_tble %>%
  mutate(length_of_stay = as.numeric(difftime(dischtime, admittime, units = "days")))

ggplot(admissions_tble2, aes(x = length_of_stay)) +
  geom_bar() +
  labs(title = "Length of Hospital Stay", x = "Length of Stay (Days)", y = "Count") +
  theme_minimal()

The plot indicates that most hospital stays are short, with a rapid decrease in frequency as the length of stay increases. This suggests that while most patients are discharged relatively quickly, a smaller number have very long stays.

Q4. patients data

Patient information is available in patients.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/patients/ for details of each field in this file. The first 10 lines are

zcat < ~/mimic/hosp/patients.csv.gz | head
subject_id,gender,anchor_age,anchor_year,anchor_year_group,dod
10000032,F,52,2180,2014 - 2016,2180-09-09
10000048,F,23,2126,2008 - 2010,
10000068,F,19,2160,2008 - 2010,
10000084,M,72,2160,2017 - 2019,2161-02-13
10000102,F,27,2136,2008 - 2010,
10000108,M,25,2163,2014 - 2016,
10000115,M,24,2154,2017 - 2019,
10000117,F,48,2174,2008 - 2010,
10000178,F,59,2157,2017 - 2019,

Q4.1 Ingestion

Import patients.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/patients/) as a tibble patients_tble.

Answer

patients_tble <- read_csv("~/mimic/hosp/patients.csv.gz") |>
  as_tibble() |>
  print(width = Inf)
Rows: 299712 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): gender, anchor_year_group
dbl  (3): subject_id, anchor_age, anchor_year
date (1): dod

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 299,712 × 6
   subject_id gender anchor_age anchor_year anchor_year_group dod       
        <dbl> <chr>       <dbl>       <dbl> <chr>             <date>    
 1   10000032 F              52        2180 2014 - 2016       2180-09-09
 2   10000048 F              23        2126 2008 - 2010       NA        
 3   10000068 F              19        2160 2008 - 2010       NA        
 4   10000084 M              72        2160 2017 - 2019       2161-02-13
 5   10000102 F              27        2136 2008 - 2010       NA        
 6   10000108 M              25        2163 2014 - 2016       NA        
 7   10000115 M              24        2154 2017 - 2019       NA        
 8   10000117 F              48        2174 2008 - 2010       NA        
 9   10000178 F              59        2157 2017 - 2019       NA        
10   10000248 M              34        2192 2014 - 2016       NA        
# ℹ 299,702 more rows

Q4.2 Summary and visualization

Summarize variables gender and anchor_age by graphics, and explain any patterns you see.

Answer

patients_tble %>%
  ggplot() +
  geom_bar(aes(x = gender)) +
  theme_minimal() +
  labs(title = "Gender Distribution", x = "Gender", y = "Count")

The bar heights illustrate that the number of females is slightly higher than the number of males, but the two are relatively close in count.

patients_tble %>%
  ggplot() +
  geom_bar(aes(x = anchor_age)) +
  theme_minimal() +
  labs(title = "Age Distribution", x = "Age", y = "Count")

The plot reveals a non-uniform pattern, characterized by a peak in the early to mid-20s followed by a gradual decline with increasing age. There’s a slight uptick in the mid-50s before a continuous decline in subsequent age groups. However, an unusual spike occurs around age 90, due to data rounding or binning practices for privacy or data processing purposes.

Q5. Lab results

labevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/labevents/) contains all laboratory measurements for patients. The first 10 lines are

zcat < ~/mimic/hosp/labevents.csv.gz | head
labevent_id,subject_id,hadm_id,specimen_id,itemid,order_provider_id,charttime,storetime,value,valuenum,valueuom,ref_range_lower,ref_range_upper,flag,priority,comments
1,10000032,,45421181,51237,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,1.4,1.4,,0.9,1.1,abnormal,ROUTINE,
2,10000032,,45421181,51274,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,___,15.1,sec,9.4,12.5,abnormal,ROUTINE,VERIFIED.
3,10000032,,52958335,50853,P28Z0X,2180-03-23 11:51:00,2180-03-25 11:06:00,___,15,ng/mL,30,60,abnormal,ROUTINE,NEW ASSAY IN USE ___: DETECTS D2 AND D3 25-OH ACCURATELY.
4,10000032,,52958335,50861,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,102,102,IU/L,0,40,abnormal,ROUTINE,
5,10000032,,52958335,50862,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,3.3,3.3,g/dL,3.5,5.2,abnormal,ROUTINE,
6,10000032,,52958335,50863,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,109,109,IU/L,35,105,abnormal,ROUTINE,
7,10000032,,52958335,50864,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,___,8,ng/mL,0,8.7,,ROUTINE,MEASURED BY ___.
8,10000032,,52958335,50868,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,12,12,mEq/L,8,20,,ROUTINE,
9,10000032,,52958335,50878,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,143,143,IU/L,0,40,abnormal,ROUTINE,

d_labitems.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/d_labitems/) is the dictionary of lab measurements.

zcat < ~/mimic/hosp/d_labitems.csv.gz | head
itemid,label,fluid,category
50801,Alveolar-arterial Gradient,Blood,Blood Gas
50802,Base Excess,Blood,Blood Gas
50803,"Calculated Bicarbonate, Whole Blood",Blood,Blood Gas
50804,Calculated Total CO2,Blood,Blood Gas
50805,Carboxyhemoglobin,Blood,Blood Gas
50806,"Chloride, Whole Blood",Blood,Blood Gas
50808,Free Calcium,Blood,Blood Gas
50809,Glucose,Blood,Blood Gas
50810,"Hematocrit, Calculated",Blood,Blood Gas

We are interested in the lab measurements of creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931). Retrieve a subset of labevents.csv.gz that only containing these items for the patients in icustays_tble. Further restrict to the last available measurement (by storetime) before the ICU stay. The final labevents_tble should have one row per ICU stay and columns for each lab measurement.

Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make labevents_pq folder available at the current working directory hw3, for example, by a symbolic link.

Answer

relevant_itemids <- c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931)

labevents_filtered <- arrow::open_dataset("labevents_pq") |>
  filter(itemid %in% relevant_itemids, subject_id %in% icustays_tble$subject_id) |> 
  select(subject_id, itemid, valuenum, storetime) |>
  collect() 
icustays_tble1 <- icustays_tble |>
  select(subject_id, intime, stay_id, outtime) 
aggregated_data <- labevents_filtered %>%
  left_join(icustays_tble1, by = "subject_id") %>%
  filter(storetime < intime) %>%
  group_by(subject_id, stay_id, itemid) %>%
  summarize(last_measurement = last(valuenum, order_by = storetime)) %>%
  collect() %>%
  print(width = Inf)
Warning in left_join(., icustays_tble1, by = "subject_id"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 845 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
`summarise()` has grouped output by 'subject_id', 'stay_id'. You can override
using the `.groups` argument.
# A tibble: 525,174 × 4
# Groups:   subject_id, stay_id [68,467]
   subject_id  stay_id itemid last_measurement
        <dbl>    <dbl>  <int>            <dbl>
 1   10000032 39553978  50882             25  
 2   10000032 39553978  50902             95  
 3   10000032 39553978  50912              0.7
 4   10000032 39553978  50931            102  
 5   10000032 39553978  50971              6.7
 6   10000032 39553978  50983            126  
 7   10000032 39553978  51221             41.1
 8   10000032 39553978  51301              6.9
 9   10000980 39765666  50882             21  
10   10000980 39765666  50902            109  
# ℹ 525,164 more rows
d_labitems <- read_csv("~/mimic/hosp/d_labitems.csv.gz") 
Rows: 1622 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): label, fluid, category
dbl (1): itemid

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
item_label <- d_labitems %>%
  filter(itemid %in% relevant_itemids) %>%
  deframe() 
Warning: `x` must be a one- or two-column data frame in `deframe()`.
aggregated_data<- aggregated_data %>%
  mutate(itemid = as.character(itemid))

labevents_tble <- aggregated_data %>%
  pivot_wider(names_from = itemid, values_from = last_measurement,
              names_glue = "{item_label[itemid]}") %>%
  select(subject_id, stay_id, everything())

print(labevents_tble, width = Inf)
# A tibble: 68,467 × 10
# Groups:   subject_id, stay_id [68,467]
   subject_id  stay_id Bicarbonate Chloride Creatinine Glucose Potassium Sodium
        <dbl>    <dbl>       <dbl>    <dbl>      <dbl>   <dbl>     <dbl>  <dbl>
 1   10000032 39553978          25       95        0.7     102       6.7    126
 2   10000980 39765666          21      109        2.3      89       3.9    144
 3   10001217 34592300          30      104        0.5      87       4.1    142
 4   10001217 37067082          22      108        0.6     112       4.2    142
 5   10001725 31205490          NA       98       NA        NA       4.1    139
 6   10001884 37510196          30       88        1.1     141       4.5    130
 7   10002013 39060235          24      102        0.9     288       3.5    137
 8   10002155 31090461          23       98        2.8     117       4.9    135
 9   10002155 32358465          26       85        1.4     133       5.7    120
10   10002155 33685454          24      105        1.1     138       4.6    139
   Hematocrit `White Blood Cells`
        <dbl>               <dbl>
 1       41.1                 6.9
 2       27.3                 5.3
 3       37.4                 5.4
 4       38.1                15.7
 5       NA                  NA  
 6       39.7                12.2
 7       34.9                 7.2
 8       25.5                17.9
 9       22.4                 9.8
10       39.7                 7.9
# ℹ 68,457 more rows

Q6. Vitals from charted events

chartevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/chartevents/) contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are

zcat < ~/mimic/icu/chartevents.csv.gz | head
subject_id,hadm_id,stay_id,caregiver_id,charttime,storetime,itemid,value,valuenum,valueuom,warning
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220179,82,82,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220180,59,59,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220181,63,63,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220045,94,94,bpm,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220179,85,85,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220180,55,55,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220181,62,62,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220210,20,20,insp/min,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220277,95,95,%,0

d_items.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/d_items/) is the dictionary for the itemid in chartevents.csv.gz.

zcat < ~/mimic/icu/d_items.csv.gz | head
itemid,label,abbreviation,linksto,category,unitname,param_type,lownormalvalue,highnormalvalue
220001,Problem List,Problem List,chartevents,General,,Text,,
220003,ICU Admission date,ICU Admission date,datetimeevents,ADT,,Date and time,,
220045,Heart Rate,HR,chartevents,Routine Vital Signs,bpm,Numeric,,
220046,Heart rate Alarm - High,HR Alarm - High,chartevents,Alarms,bpm,Numeric,,
220047,Heart Rate Alarm - Low,HR Alarm - Low,chartevents,Alarms,bpm,Numeric,,
220048,Heart Rhythm,Heart Rhythm,chartevents,Routine Vital Signs,,Text,,
220050,Arterial Blood Pressure systolic,ABPs,chartevents,Routine Vital Signs,mmHg,Numeric,90,140
220051,Arterial Blood Pressure diastolic,ABPd,chartevents,Routine Vital Signs,mmHg,Numeric,60,90
220052,Arterial Blood Pressure mean,ABPm,chartevents,Routine Vital Signs,mmHg,Numeric,,

We are interested in the vitals for ICU patients: heart rate (220045), systolic non-invasive blood pressure (220179), diastolic non-invasive blood pressure (220180), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items for the patients in icustays_tble. Further restrict to the first vital measurement within the ICU stay. The final chartevents_tble should have one row per ICU stay and columns for each vital measurement.

Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make chartevents_pq folder available at the current working directory, for example, by a symbolic link.

Answer

re_itemids <- c(220045, 220179, 220180, 223761, 220210)

chartevents <- arrow::open_dataset("chartevents_pq") |>
  filter(itemid %in% re_itemids, subject_id %in% icustays_tble$subject_id) |> 
  select(subject_id, itemid, valuenum, charttime) |>
  collect() 
aggregated_data2 <- chartevents %>%
  left_join(icustays_tble1, by = "subject_id") %>%
  filter(charttime >= intime & charttime <= outtime) %>%
  group_by(subject_id, stay_id, itemid) %>%
  summarize(first_measurement = first(valuenum, order_by = charttime)) %>%
  collect() 
Warning in left_join(., icustays_tble1, by = "subject_id"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 590 of `x` matches multiple rows in `y`.
ℹ Row 11 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
`summarise()` has grouped output by 'subject_id', 'stay_id'. You can override
using the `.groups` argument.
d_items <- read_csv("~/mimic/icu/d_items.csv.gz")
Rows: 4014 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): label, abbreviation, linksto, category, unitname, param_type
dbl (3): itemid, lownormalvalue, highnormalvalue

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
item_abbr <- d_items %>%
  filter(itemid %in% re_itemids) %>%
  deframe() 
Warning: `x` must be a one- or two-column data frame in `deframe()`.
aggregated_data2 <- aggregated_data2 %>%
  mutate(itemid = as.character(itemid))

chartevents_tble <- aggregated_data2 %>%
  pivot_wider(names_from = itemid, values_from = first_measurement,
              names_glue = "{item_abbr[itemid]}") %>%
  select(subject_id, stay_id, everything())

print(chartevents_tble, width = Inf)
# A tibble: 73,164 × 7
# Groups:   subject_id, stay_id [73,164]
   subject_id  stay_id `Heart Rate` `Non Invasive Blood Pressure systolic`
        <dbl>    <dbl>        <dbl>                                  <dbl>
 1   10000032 39553978           91                                     84
 2   10000980 39765666           77                                    150
 3   10001217 34592300           96                                    167
 4   10001217 37067082           86                                    151
 5   10001725 31205490           55                                     73
 6   10001884 37510196           38                                    180
 7   10002013 39060235           80                                    104
 8   10002155 31090461           94                                    118
 9   10002155 32358465           98                                    109
10   10002155 33685454           68                                    126
   `Non Invasive Blood Pressure diastolic` `Respiratory Rate`
                                     <dbl>              <dbl>
 1                                      48                 24
 2                                      77                 23
 3                                      95                 11
 4                                      90                 18
 5                                      56                 19
 6                                      12                 10
 7                                      70                 14
 8                                      51                 18
 9                                      65                 23
10                                      61                 18
   `Temperature Fahrenheit`
                      <dbl>
 1                     98.7
 2                     98  
 3                     97.6
 4                     98.5
 5                     97.7
 6                     98.1
 7                     97.2
 8                     96.9
 9                     97.7
10                     95.9
# ℹ 73,154 more rows

Q7. Putting things together

Let us create a tibble mimic_icu_cohort for all ICU stays, where rows are all ICU stays of adults (age at intime >= 18) and columns contain at least following variables

  • all variables in icustays_tble
  • all variables in admissions_tble
  • all variables in patients_tble
  • the last lab measurements before the ICU stay in labevents_tble
  • the first vital measurements during the ICU stay in chartevents_tble

The final mimic_icu_cohort should have one row per ICU stay and columns for each variable.

Answer

icu_patients <- icustays_tble %>%
  left_join(patients_tble, by = "subject_id") %>%
  mutate(age_intime = year(intime)-anchor_year + anchor_age) %>%
  filter(age_intime >= 18) %>%
  print(width = Inf)
# A tibble: 73,181 × 14
   subject_id  hadm_id  stay_id first_careunit                                  
        <dbl>    <dbl>    <dbl> <chr>                                           
 1   10000032 29079034 39553978 Medical Intensive Care Unit (MICU)              
 2   10000980 26913865 39765666 Medical Intensive Care Unit (MICU)              
 3   10001217 24597018 37067082 Surgical Intensive Care Unit (SICU)             
 4   10001217 27703517 34592300 Surgical Intensive Care Unit (SICU)             
 5   10001725 25563031 31205490 Medical/Surgical Intensive Care Unit (MICU/SICU)
 6   10001884 26184834 37510196 Medical Intensive Care Unit (MICU)              
 7   10002013 23581541 39060235 Cardiac Vascular Intensive Care Unit (CVICU)    
 8   10002155 20345487 32358465 Medical Intensive Care Unit (MICU)              
 9   10002155 23822395 33685454 Coronary Care Unit (CCU)                        
10   10002155 28994087 31090461 Medical/Surgical Intensive Care Unit (MICU/SICU)
   last_careunit                                    intime             
   <chr>                                            <dttm>             
 1 Medical Intensive Care Unit (MICU)               2180-07-23 14:00:00
 2 Medical Intensive Care Unit (MICU)               2189-06-27 08:42:00
 3 Surgical Intensive Care Unit (SICU)              2157-11-20 19:18:02
 4 Surgical Intensive Care Unit (SICU)              2157-12-19 15:42:24
 5 Medical/Surgical Intensive Care Unit (MICU/SICU) 2110-04-11 15:52:22
 6 Medical Intensive Care Unit (MICU)               2131-01-11 04:20:05
 7 Cardiac Vascular Intensive Care Unit (CVICU)     2160-05-18 10:00:53
 8 Medical Intensive Care Unit (MICU)               2131-03-09 21:33:00
 9 Coronary Care Unit (CCU)                         2129-08-04 12:45:00
10 Medical/Surgical Intensive Care Unit (MICU/SICU) 2130-09-24 00:50:00
   outtime               los gender anchor_age anchor_year anchor_year_group
   <dttm>              <dbl> <chr>       <dbl>       <dbl> <chr>            
 1 2180-07-23 23:50:47 0.410 F              52        2180 2014 - 2016      
 2 2189-06-27 20:38:27 0.498 F              73        2186 2008 - 2010      
 3 2157-11-21 22:08:00 1.12  F              55        2157 2011 - 2013      
 4 2157-12-20 14:27:41 0.948 F              55        2157 2011 - 2013      
 5 2110-04-12 23:59:56 1.34  F              46        2110 2011 - 2013      
 6 2131-01-20 08:27:30 9.17  F              68        2122 2008 - 2010      
 7 2160-05-19 17:33:33 1.31  F              53        2156 2008 - 2010      
 8 2131-03-10 18:09:21 0.859 F              80        2128 2008 - 2010      
 9 2129-08-10 17:02:38 6.18  F              80        2128 2008 - 2010      
10 2130-09-27 22:13:41 3.89  F              80        2128 2008 - 2010      
   dod        age_intime
   <date>          <dbl>
 1 2180-09-09         52
 2 2193-08-26         76
 3 NA                 55
 4 NA                 55
 5 NA                 46
 6 2131-01-20         77
 7 NA                 57
 8 2131-03-10         83
 9 2131-03-10         81
10 2131-03-10         82
# ℹ 73,171 more rows
mimic_icu_cohort <- icu_patients %>%
  left_join(admissions_tble, by = c("subject_id","hadm_id")) %>%
  left_join(labevents_tble, by = c("subject_id","stay_id")) %>%
  left_join(chartevents_tble, by = c("subject_id","stay_id")) 
mimic_icu_cohort
# A tibble: 73,181 × 41
   subject_id  hadm_id  stay_id first_careunit last_careunit intime             
        <dbl>    <dbl>    <dbl> <chr>          <chr>         <dttm>             
 1   10000032 29079034 39553978 Medical Inten… Medical Inte… 2180-07-23 14:00:00
 2   10000980 26913865 39765666 Medical Inten… Medical Inte… 2189-06-27 08:42:00
 3   10001217 24597018 37067082 Surgical Inte… Surgical Int… 2157-11-20 19:18:02
 4   10001217 27703517 34592300 Surgical Inte… Surgical Int… 2157-12-19 15:42:24
 5   10001725 25563031 31205490 Medical/Surgi… Medical/Surg… 2110-04-11 15:52:22
 6   10001884 26184834 37510196 Medical Inten… Medical Inte… 2131-01-11 04:20:05
 7   10002013 23581541 39060235 Cardiac Vascu… Cardiac Vasc… 2160-05-18 10:00:53
 8   10002155 20345487 32358465 Medical Inten… Medical Inte… 2131-03-09 21:33:00
 9   10002155 23822395 33685454 Coronary Care… Coronary Car… 2129-08-04 12:45:00
10   10002155 28994087 31090461 Medical/Surgi… Medical/Surg… 2130-09-24 00:50:00
# ℹ 73,171 more rows
# ℹ 35 more variables: outtime <dttm>, los <dbl>, gender <chr>,
#   anchor_age <dbl>, anchor_year <dbl>, anchor_year_group <chr>, dod <date>,
#   age_intime <dbl>, admittime <dttm>, dischtime <dttm>, deathtime <dttm>,
#   admission_type <chr>, admit_provider_id <chr>, admission_location <chr>,
#   discharge_location <chr>, insurance <chr>, language <chr>,
#   marital_status <chr>, race <chr>, edregtime <dttm>, edouttime <dttm>, …

Q8. Exploratory data analysis (EDA)

Summarize the following information about the ICU stay cohort mimic_icu_cohort using appropriate numerics or graphs:

  • Length of ICU stay los vs demographic variables (race, insurance, marital_status, gender, age at intime)

  • Length of ICU stay los vs the last available lab measurements before ICU stay

  • Length of ICU stay los vs the first vital measurements within the ICU stay

  • Length of ICU stay los vs first ICU unit

Answer

race

demographic_func <- function(data, demographic) {
  data %>%
    group_by(!!sym(demographic)) %>%
    summarize(mean_los = mean(los, na.rm = TRUE), 
              median_los = median(los, na.rm = TRUE),
              n = n()) %>%
    ungroup()
}

race_summary <- demographic_func(mimic_icu_cohort, "race")

race_summary %>% 
  group_by(race) %>% 
  summarise(
    count = n(), 
    mean = mean(mean_los,na.rm = TRUE),
    min = min(mean_los,na.rm = TRUE), 
    max = max(mean_los,na.rm =TRUE))
# A tibble: 33 × 5
   race                          count  mean   min   max
   <chr>                         <int> <dbl> <dbl> <dbl>
 1 AMERICAN INDIAN/ALASKA NATIVE     1  4.46  4.46  4.46
 2 ASIAN                             1  3.49  3.49  3.49
 3 ASIAN - ASIAN INDIAN              1  4.20  4.20  4.20
 4 ASIAN - CHINESE                   1  3.36  3.36  3.36
 5 ASIAN - KOREAN                    1  4.23  4.23  4.23
 6 ASIAN - SOUTH EAST ASIAN          1  3.04  3.04  3.04
 7 BLACK/AFRICAN                     1  3.82  3.82  3.82
 8 BLACK/AFRICAN AMERICAN            1  3.28  3.28  3.28
 9 BLACK/CAPE VERDEAN                1  3.22  3.22  3.22
10 BLACK/CARIBBEAN ISLAND            1  3.61  3.61  3.61
# ℹ 23 more rows
ggplot(mimic_icu_cohort, aes(x = race, y = los, fill = race)) +
  geom_boxplot(width = 0.7) +
  labs(title = "Length of ICU Stay by Race",
       x = "Race",
       y = "Length of Stay (days)") +
  theme(
    legend.position = "none",
    axis.text.x = element_text(angle = 45, hjust = 1)
  ) 

The distribution of stay lengths for each racial group is spread out, with most data points concentrated at the lower end of the scale, indicating shorter ICU stays with a mean of 2.65 to 4.46 days. AMERICAN INDIAN/ALASKA NATIVE have a longest length of stay and HISPANIC/LATINO - MEXICAN have the least. However, there are outliers in almost every racial category that show much longer ICU stays.

Marita statu

ggplot(mimic_icu_cohort, aes(x = marital_status, y = los)) +
  geom_boxplot(width = 0.7) +
  labs(title = "Length of ICU Stay by Marita status",
       x = "Marita status",
       y = "Length of Stay (days)")

marital_summary <- demographic_func(mimic_icu_cohort, "marital_status")

marital_summary %>% 
  group_by(marital_status) %>% 
  summarise(
    count = n(), 
    mean = mean(mean_los,na.rm = TRUE),
    min = min(mean_los,na.rm = TRUE), 
    max = max(mean_los,na.rm =TRUE))
# A tibble: 5 × 5
  marital_status count  mean   min   max
  <chr>          <int> <dbl> <dbl> <dbl>
1 DIVORCED           1  3.39  3.39  3.39
2 MARRIED            1  3.46  3.46  3.46
3 SINGLE             1  3.40  3.40  3.40
4 WIDOWED            1  3.13  3.13  3.13
5 <NA>               1  4.28  4.28  4.28

Each category shows a variety of lengths of stay, with numerous outliers indicating patients who had exceptionally long ICU stays.

Median Stay: The median length of stay across different marital statuses appears to be similar, as indicated by the line within each box.

Outliers: All categories show outliers; however, it’s noteworthy that the ‘NA’ category seems to have fewer outliers, which might be due to a smaller sample size or less variability in this group.

Interquartile Range: The IQRs appear to be fairly consistent across the categories, suggesting a similar distribution of stay lengths among the different marital statuses.

Gender

ggplot(mimic_icu_cohort, aes(x = gender, y = los)) +
  geom_boxplot() +
  labs(title = "Length of ICU Stay by Gender",
       x = "Gender",
       y = "Length of Stay (days)")

The plot shows that most male and female are similar in terms of length of stay, with most data points concentrated at the lower end of the scale, indicating shorter ICU stays. However, considering the outliers, male tend to have more ones in long ICU stays than female.

Age at intime

ggplot(mimic_icu_cohort, aes(x = age_intime, y = los)) +
  geom_point() +
  labs(title = "Length of ICU Stay by Age",
       x = "Age at intime",
       y = "Length of Stay (days)")

The data points are heavily concentrated at the lower end of the length of stay, indicating that most patients have shorter ICU stays, typically less than 30 days. Patients of a wide range of ages are represented, with a dense clustering of ICU stays for patients in their middle to late adult years.

Last Available Lab Measurements Before ICU Stay

mimic_icu_cohort %>%
  select(29:36, los) %>%  
  gather(Measurement, Last_Lab_Value, -los) %>%  
  ggplot(aes(x = Last_Lab_Value, y = los)) +
    geom_point(alpha = 0.5) + 
    facet_wrap(~ Measurement, scales = "free") + 
    labs(title = "Length of ICU Stay vs Last Available Lab Measurements Before ICU Stay",
         x = "Last Lab Measurement Value",
         y = "Length of Stay (days)") +
  theme_minimal()
Warning: Removed 60686 rows containing missing values (`geom_point()`).

Observations from the plots are as follows:

Bicarbonate: The scatter plot shows a dense clustering of stays around the normal range of bicarbonate levels, with some outliers showing both higher bicarbonate levels and longer stays.

Chloride: This plot also shows a dense concentration of points around the typical chloride value range, with some outliers extending towards higher values and longer stays.

Creatinine: The data points are heavily concentrated at lower creatinine values with most stays being short. There are outliers with higher values, some of which correspond to longer stays.

Glucose: This plot shows a broad distribution of glucose levels, with a heavy concentration of short stays. There are numerous outliers with both high glucose levels and longer lengths of stay.

Hematocrit: A dense clustering is observed at normal hematocrit levels with most stays being short. There are scattered outliers across a range of hematocrit values.

Potassium: Most of the data points are concentrated around the normal potassium levels, with a few outliers having longer lengths of stay.

Sodium: The sodium plot shows a dense central clustering with a few outliers. Most of the stays are short, but there are some longer stays associated with both lower and higher sodium levels.

White Blood Cells (WBC): There’s a dense clustering at the lower WBC counts with short stays, and several outliers indicating higher WBC counts with longer stays.

The overall pattern in these scatter plots suggests that for the majority of patients, the last lab measurement values taken before ICU admission are within typical ranges and are associated with shorter lengths of ICU stay. The presence of outliers in each plot could indicate cases with more severe underlying conditions or complications that lead to both abnormal lab values and longer ICU stays.

First vital Measurements Within ICU Stay

mimic_icu_cohort %>%
  select(37:41, los) %>%  
  gather(Measurements, First_vital_Value, -los) %>%  
  ggplot(aes(x = First_vital_Value, y = los)) +
    geom_point(alpha = 0.5) + 
    facet_wrap(~ Measurements, scales = "free") + 
    labs(title = "Length of ICU Stay vs First vital Measurements Within ICU Stay",
         x = "First vital Measurements",
         y = "Length of Stay (days)") +
  theme_minimal()
Warning: Removed 3440 rows containing missing values (`geom_point()`).

Observations from the plots are as follows:

Heart Rate: The data points are densely populated at lower lengths of stay, with a few outliers extending up to 90 days. The heart rate values range widely, from low to high, but the majority of stays are short, regardless of the heart rate.

NIBP Diastolic: Most of the data points are densely populated at lower lengths of stay, with a few outliers.

NIBP Systolic: Systolic pressure points are heavily clustered at shorter lengths of stay with a broad range of values. There are outliers with long stays and varying systolic readings.

Respiratory Rate: Respiratory rate are heavily clustered at shorter lengths of stay with a broad range of values. There is a small number of outliers with higher respiratory rates associated with longer stays.

Temperature Fahrenheit: The temperature readings are densely clustered at the normal temperature range, with the length of stay predominantly short. There are outliers with extreme temperature readings associated with both short and longer lengths of stay.

Overall, the plots suggest that for most patients in this cohort, the first vital measurements do not show a clear correlation with the length of ICU stay, which is concentrated at the lower end (indicating shorter stays).

First ICU Unit

ggplot(mimic_icu_cohort, aes(x = first_careunit, y = los)) +
  geom_boxplot() +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) +
  labs(title = "Length of ICU Stay by First ICU Unit",
       x = "ICU Unit",
       y = "Length of Stay (days)")

The median length of ICU stay tends to be similar across different units, with the middle 50% of patients typically staying within a range of approximately up to 10 days.

There are outliers in each ICU unit category, with some stays significantly longer than the median, extending up to and beyond 90 days.

The distribution of lengths of stay seems to be similar across most units, but there may be a slightly higher number of outliers in certain units like the Neuro Stepdown and the Trauma SICU, suggesting that patients in these units might occasionally experience much longer stays.